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Abstract 

Background: Open chromatin is implicated in regulatory processes; thus, variations in chromatin structure may 
contribute to variations in gene expression and other phenotypes. In this work, we perform targeted deep 
sequencing for open chromatin, and array-based genotyping across the genomes of 72 monozygotic twins to 
identify genetic factors regulating co-twin discordance in chromatin accessibility. 

Results: We show that somatic mutations cause chromatin discordance mainly via the disruption of transcription 
factor binding sites. Structural changes in DNA due to C:G to A:T transversions are under purifying selection due to 
a strong impact on chromatin accessibility. We show that CpGs whose methylation is specifically regulated during 
cellular differentiation appear to be protected from high mutation rates of 5'-methylcytosines, suggesting that the 
spectrum of CpG variations may be shaped fully at the developmental level but not through natural selection. 
Based on the association mapping of within-pair chromatin differences, we search for cases in which twin siblings 
with a particular genotype had chromatin discordance at the relevant locus. We identify 1,325 chromatin sites that 
are differentially accessible, depending on the genotype of a nearby locus, suggesting that epigenetic differences 
can control regulatory variations via interactions with genetic factors. Poised promoters present high levels of 
chromatin discordance in association with either somatic mutations or genetic-epigenetic interactions. 

Conclusion: Our observations illustrate how somatic mutations and genetic polymorphisms may contribute to 
regulatory, and ultimately phenotypic, discordance. 



Background 

Open chromatin provides access for a wide spectrum of 
DNA binding proteins to regulate transcription, DNA 
repair, recombination, replication, and so on. As such, 
open chromatin profiling has been used to identify the 
genomic locations of various regulatory regions, includ- 
ing promoters, enhancers, insulators, silencers, replica- 
tion origins, and recombination hotspots [1-4]. The 
binding sites for transcription factors (TFs) have been 
extensively profiled based on the distribution of sequen- 
cing tags derived from DNase I hypersensitive sites [5,6]. 
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The FAIRE (formaldehyde-assisted isolation of regula- 
tory elements) technique has also been used to capture 
open chromatin regions in the genome [2,7-11]. 

Chromatin accessibility has been a focal point in the 
studies exploring the intersection of genetics and epigenet- 
ics. Meanwhile, coupling between chromatin accessibility 
and underlying genetic polymorphisms renders chromatin 
status a heritable feature [12]. In a recent study [13], asso- 
ciation mapping was used to understand the genetic basis 
of chromatin regulation, and in our previous work [11], 
we made a similar attempt based on the genetic linkage 
of FAIRE signals. It was also demonstrated that disease- 
associated regulatory variations could be mapped to 
FAIRE regions [9] or DNase I hypersensitive sites [14]. 
Furthermore, a method based on FAIRE DNA genotyp- 
ing for the systematic identification of regulatory poly- 
morphisms associated with different phenotypes has 
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also been introduced [10]. However, how somatic muta- 
tions are able to affect chromatin accessibility still re- 
mains to be elucidated. 

While monozygotic (MZ) twins are useful for studying 
epigenetic differences caused by differential environmental 
exposure, DNA methylation is the only epigenetic mech- 
anism that has been studied in depth with regard to MZ 
discordance [15-17]. A recent study [18] showed that 
DNA methylation could function as an intermediary of 
genetic factors associated with particular traits or pheno- 
types. However, the modulation of chromatin structure is 
central to epigenetic regulation, and chromatin accessibil- 
ity in particular can be linked directly to transcriptional 
activity [1,13] through a combination of multiple epigen- 
etic mechanisms, including DNA methylation. Yet despite 
the importance of open chromatin in transcriptional regu- 
lation, MZ discordance in chromatin accessibility has 
never been explored. 

In this work, we investigated co-twin chromatin dis- 
cordance and the associated genetic factors. We first 
sought to obtain the full spectrum of somatic and genetic 
single nucleotide variations that underlie open chromatin. 
To this end, we compared the patterns of somatic muta- 
tions (that is, within-pair sequence differences) with those 
of genetic polymorphisms (that is, between-pair sequence 
differences). In addition, we attempted to characterize 
genetic-epigenetic interactions by finding genetic poly- 
morphisms that influence within-pair differences in chro- 
matin accessibility. Thus, we performed quantitative trait 
loci (QTL) mapping for quantitative within-pair chroma- 
tin differences across twin pairs. 

Results and discussion 

We performed high-quality, in-depth open chromatin 
sequencing (approximately 72x, with 92% of Q >30 
bases; Additional file 1) for 36 pairs of MZ twins. We se- 
lected twins aged between 30 and 60 years who were dis- 
cordant for immunological traits, mostly involving allergic 
symptoms (Additional file 2), so that more somatic differ- 
ences in immune cells could be observed. Normalized 
chromatin accessibility was determined for the identified 
open chromatin regions. Chromatin accessibility showed a 
higher correlation between twin siblings than between un- 
related individuals (Additional file 3). Our data showed 
good agreement with the public FAIRE and DNase I data 
in GM12878 lymphoblastoid cells (Additional file 4). To 
assess the accuracy of our sequencing-based variant detec- 
tion method, we employed Affymetrix 6.0 SNP arrays to 
genotype peripheral blood DNA. A concordance rate of 
approximately 76% was observed when lower confidence 
variants were included to increase the overlap with the 
array data (see Materials and methods; Additional files 5 
and 6). Somatic regulatory mutations, defined as variants 
that resided in open chromatin and significantly differed 



between twin siblings, were identified and confirmed to 
an unprecedented level thanks to targeted deep sequen- 
cing of open chromatin in many individuals. Mean- 
while, regulatory polymorphisms (SNPs) were defined 
as variants that showed a minimum allele frequency of 
1% among all the FAIRE reads across all samples (see 
Materials and methods). 

The identified mutations were coupled to increases 
in the level of discordance in chromatin accessibility 
(Figure 1A) and therefore to increases in the likelihood 
of discordance at a given locus (Figure IB). Mutations 
that decrease chromatin accessibility were more fre- 
quent than mutations that increase chromatin accessi- 
bility (Figure 1C). The impact of the mutations was 
stronger when they were located closer to the center of a 
region of open chromatin (Additional file 7) and when 
their density relative to the size of the open chromatin re- 
gion (that is, number of mutations per base pair) was 
higher (Additional file 8). We then matched our mutation 
and chromatin discordance results with chromatin status 
modeling data from multilayer chromatin signatures such 
as open chromatin, histone modifications and TF binding 
[19]. Remarkably, poised promoters, more than any other 
regulatory states, exhibited the highest mutation rates 
(Figure ID) and the highest discordance levels (Figure IE), 
although they had the smallest site number and length. 
Importantly, a higher magnitude of mutational effects 
was observed when TF binding sites (TFBSs) located 
within open chromatin regions were disrupted (Figure IB). 
Changes in TF binding have been shown to be a pri- 
mary mechanism through which DNA polymorphisms 
can affect chromatin structure [11,13]. We found exam- 
ples in which mutations in one twin sibling disrupted 
the binding motifs of key TFs implicated in immune 
cell development and function (ETS1, ELF1, PAX5, and 
RUNX) and decreased chromatin accessibility in an 
allele-specific manner (Figure 2). 

Our next concern was whether somatic mutation-derived 
chromatin discordance is associated with perturbations in 
gene expression. For that, we conducted expression 
microarray analyses and calculated the genome-wide 
expression divergence between twin siblings. We first 
compared the within-pair expression difference of genes 
showing promoter chromatin discordance with that of 
genes showing no chromatin discordance. Our results 
supported the significant effect of chromatin discordance 
on differential gene expression (Figure 3A). A rank cor- 
relation between high, medium, and low magnitudes of 
chromatin discordance and high, medium, and low levels 
of differential gene expression was statistically significant 
as assessed based on permutation tests (Figure 3B), 
thereby indicating that high accessibility tends to direct 
high expression levels. We then identified genes with so- 
matic mutations that disrupted the TF motif and induced 
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Figure 1 Effect of somatic mutations on chromatin regulation. (A) Within-pair differences in chromatin accessibility based on the presence 
and absence of mutations. (B) Correlation of chromatin discordance with the presence of mutations in open chromatin and in TFBSs as predicted 
based on Transfac position weight matrices, motif enrichment analysis by Homer, and ChlP-seq data in GM12878 lymphoblastoid cells. (C) Correlation 
of chromatin discordance with mutation types. (D,E) Within-pair differences in mutation rates (D) and chromatin accessibility (E) according to different 
regulatory states. 



chromatin changes at proximal sites. TF-binding chroma- 
tin immunoprecipitation sequencing (ChlP-seq) data were 
incorporated simultaneously to confirm the binding of 
TFs to the corresponding motifs. We then found that 
genes with disrupted TF motifs exhibited a larger degree 
of differential expression than controls, when all genes 
were used as controls (Figure 3C). This gap in differential 
expression compared with the control was most stri- 
king when TF motifs were confirmed by the ChIP data 
(Figure 3C). These findings indicate that somatic muta- 
tions in proximal accessible chromatin regions can disrupt 
TF binding at the respective motifs and alter chromatin 
accessibility, which can influence the transcription of the 
connected genes. 

Somatic C:G > A:T transversions were observed as fre- 
quently as C:G > T:A transitions (red bars in Figure 4A). 



However, the occurrence of C:G > A:T polymorphisms 
was remarkably lower than that of the corresponding 
mutations (blue bars in Figure 4A), implying the influence 
of negative selection. The loss of C or G bases, particularly 
through changes that greatly alter their chemical struc- 
ture, namely transversions, may exert large functional ef- 
fects. Indeed, C:G > A:T transversions represented the 
base substitutions that were most commonly associated 
with chromatin discordance and inter-individual variation 
(Additional file 9). This likely explains why these muta- 
tions are subject to negative selection. 

CpGs were substantially enriched in TFBSs relative to 
their overall frequency in the surrounding open chromatin 
(Additional file 10) or in the whole genome (Additional 
file 11). Nevertheless, CpGs were much less frequently 
mutated or polymorphic than expected (Figure 4B). The 
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Figure 2 Examples of mutations that decrease chromatin accessibility by disrupting TF recognition motifs. Shown below the accessibility 
signals from the MZ twins are the DNase I clusters, TF binding signals, conserved TFBSs and histone modification patterns derived from the 
ENCODE data in GM12878 lymphoblastoid cells. 



observed-to-expected ratio was approximately 0.2 for both 
mutation and polymorphism, which is approximately 
four-fold lower than the mean of the non-CpG group. We 
suspected that cytosine methylation might contribute 
to the lower than expected CpG substitution rates; 
therefore, we obtained the methylation levels for each in- 
TFBS CpG in human embryonic stem cells, hematopoietic 
progenitor cells, and mature cells from different lineages 
(neutrophils and lymphocytes) [20] and then estimated 
the changes in cytosine methylation that occur during 
B cell development. When comparing lymphocytes to 
other cell types, non-substituted CpGs showed a larger 



magnitude of differential methylation; however, mutated 
and polymorphic CpGs displayed no significant methyla- 
tion changes (Figure 4C). Similar patterns were obtained 
when the average TFBS methylation levels were used 
for comparison (Additional file 12). The role of DNA 
methylation in controlling differentiation-associated TF 
binding has been demonstrated previously [21,22], There- 
fore, we suggest that TFBS CpGs whose methylation is 
specifically regulated during cellular differentiation main- 
tain low mutation rates due to negative selection acting 
within organisms. There were no distinct signs of predom- 
inant 5-methylcytosine mutations (that is, mCpG > TpG; 
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Figure 3 Effect of chromatin discordance and somatic mutation on differential expression. (A) Within-pair differential expression according 
to proximal chromatin accessibility. (B) The statistical significance of a rank correlation between high, medium, and low magnitudes of chromatin 
discordance and high, medium, and low levels of differential gene expression, as assessed based on 10,000 permutations of differential expression 
and chromatin discordance. (C) Differential expression according to genes with broken TF motifs and relevant ChIP evidence compared with all 
genes as control. 
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Figure 4D), suggesting that the functional importance of 
regulatory CpGs can counteract the high mutation rate 
of 5-methylcytosines. The differences in methylation 
levels between the differentiated cell types (B cells versus 
neutrophils) showed weaker effects (Figure 4C; Additional 
file 12), highlighting the importance of methylation regu- 
lation during early stages of development Taken together, 
these findings suggest that the landscape of CpG varia- 
tions may be shaped by developmental, rather than evolu- 
tionary, processes. 

In this work, sequencing reads were derived only from 
open chromatin regions, thereby enabling in-depth se- 
quencing. Moreover, the functional validity of the puta- 
tive mutations was supported by the relevant chromatin 



and transcription changes. A completely independent 
dataset based on a different experimental approach (DNase 
I hypersensitivity), SNP calling (HapMap genotyping and 
1000 Genomes sequencing), and population (Yoruba) 
[13] confirmed selective constraints against transver- 
sions (Additional file 13), against changes in GpC dinu- 
cleotides (Additional file 14), and against substitutions 
at differentiation-associated CpG methylation sites 
(Additional file 15). Furthermore, the overall consistency 
between the spectra of the mutations and polymorphisms 
suggests that the majority of the mutations identified in 
the examined cell lines reflect in vivo characteristics. 

The concept of Variability genes' was suggested based on 
the finding that the Kidd blood group locus is associated 
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with within-pair differences in the total cholesterol level 
when serum lipid levels were examined across twin pairs 
with different genotypes [23]. Therefore, variability genes 
refer to genotypes that are associated with the variance of a 
trait rather than with the level of a trait, thereby implying 
that within-pair variability in MZ twins can be used to 
study genetic-epigenetic interactions [24]. In this regard, 
we attempted to identify genetic polymorphisms that are 
shared by twin siblings and are associated with within-pair 
differences in chromatin accessibility. In other words, we 
sought to find the cases in which certain chromatin sites 
are more differentially accessible between twin siblings 
who share a particular allele than between other siblings 
with different alleles. 

To this end, we performed QTL mapping by associat- 
ing the within-pair differences in chromatin accessibility 
with the genotypes shared by each twin pair as deter- 
mined using the Affymetrix SNP arrays. As previously 
suggested [23], normalized differences in chromatin 



accessibility were used instead of absolute differences 
(see Materials and methods) to rule out the possibility 
that the trait level itself is reflected in the degree of dif- 
ference. Because within-pair chromatin differences are 
very low in most cases, the chromatin loci with the high- 
est between-pair variances in chromatin discordance (for 
example, the top 1%) were selected and used for QTL 
mapping. At a false discovery rate (FDR) of 0.01, a total 
of 10,195 local (cis) associations were identified for 1,325 
chromatin loci (Figure 5A). In particular, poised pro- 
moters and active promoters showed the highest levels 
of enrichment for such associations (Figure 5B), which 
was similar with the case of somatic mutations. Within- 
pair differences in chromatin states can be caused by 
epigenetic factors, such as histone modifications or the 
expression levels of chromatin regulators or TFs, that 
reflect different histories of environmental exposure 
between twin siblings. These epigenetic differences will 
be exposed by differential TF binding in one sibling in 
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Figure 5 Interplay between genetic and non-genetic factors causing chromatin discordance. (A) Quantile-quantile plot of P values for local 
(<1 Mb) associations from the QTL mapping of chromatin discordance. The 5 th and 95 th percentiles of the beta distribution of the lvalues is 
shown (gray shading). Significant associations (FDR = 0.01) are denoted in blue. (B) The percentage of chromatin-state domains harboring the 
chromatin regions that are discordant depending on a local genotype. The number of such domains was divided by the total number of the 
domains containing the open chromatin regions used in our QTL mapping. 



the form of differential chromatin accessibility but be 
masked by genetically low TF binding to the TFBS in 
the other sibling. 

Notably, poised promoters presented high levels of 
chromatin discordance between twin siblings (Figure IE) 
either in association with a high frequency of somatic 
mutations (Figure ID) or via the mechanisms of the 
variability genotypes (Figure 5B). Poised transcriptional 
elements are critical in defining cellular responses to en- 
vironmental or developmental cues. In the context of 
cellular differentiation, bivalent histone modifications 
(for example, H3K4me3 and H3K27me3) may silence 
developmental genes in embryonic stem cells while 
keeping these genes poised for activation during later 
stages of development [25-27]. Our findings suggest that 
promoters that are currently inactive but are poised to 
be activated following particular stimuli exhibit high 
levels of chromatin discordance, due to either somatic 
mutations or genetic-epigenetic interactions, and are re- 
sponsible for differential regulatory responses to envir- 
onmental or developmental cues in twin siblings. 

By leveraging MZ twin samples, we were able to identify 
somatic mutations as within-pair genetic differences and 
provide new insight into their effect on chromatin accessi- 
bility for the first time. Previous studies concentrated on 
genetic polymorphisms residing in accessible chromatin 
[9-14]. Three recent papers [28-30] went further to con- 
firm a causal role of polymorphisms in modulating epigen- 
etic and transcription mechanisms through the regulation 
of TF binding. We expected that, from a mechanistic 
viewpoint, somatic mutations would affect transcription 
in a similar manner to polymorphisms. Indeed, we dis- 
covered that somatic mutations could lead to differen- 
tial gene expression through chromatin discordance 
and TF motif disruption. The differences between 



mutations and polymorphisms also provided new insight. 
For example, the critical substitution, namely C:G > A:T 
transversion, was less enriched for polymorphisms than 
for somatic mutations, implying the action of negative 
selection. 

Previous twin studies primarily focused on differences in 
DNA methylation [15-17]. Our study is the first to profile 
genetic differences between multiple MZ twins and exam- 
ine their linkage with chromatin discordance and differen- 
tial transcription. We expect that follow-up twin studies 
will also address genetic discordance in association with 
the epigenetic differences in question. Heterogeneity in the 
cell population may have affected our detection of genetic 
and epigenetic discordance between MZ twin siblings. 
However, mutation presence and chromatin discordance 
were associated based on the same FAIRE-seq data of the 
same cell population. The genetic and associated epigen- 
etic differences should have arisen at the identical lineage 
depth of cell differentiation, thereby mitigating the hetero- 
geneity problem in the context of genetic-epigenetic asso- 
ciation. We also found cases in which somatic mutations 
in TFBSs altered chromatin accessibility between twin 
pairs with discordant allergic traits. Several of the TFs in 
these cases were reported to have immunological func- 
tions; therefore, we speculated that disruptions in the bind- 
ing of immune-related TFs could bring about differences 
of chromatin accessibility and allergic phenotypes. This 
raises the possibility that chromatin discordancy related to 
somatic mutations could affect phenotypic diversity, but 
identifying a direct connection between chromatin accessi- 
bility and the target trait requires further study. 

Conclusions 

In this work, we performed targeted deep sequencing 
for open chromatin and gene expression microarray 
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experiments across the genomes of a fairly large number 
of MZ twins. Our integrative analyses identified somatic 
mutations causing chromatin discordance mainly via the 
disruption of TFBSs. The spectrum of somatic and gen- 
etic sequence variations underlying discordant open 
chromatin suggested selection pressures against C:G > A: 
T transversions. Of note, CpGs located in TFBSs were 
found to be mutated or polymorphic only one-fifth as 
often as expected. In particular, those CpGs whose 
methylation is specifically regulated during cellular dif- 
ferentiation appeared to be protected from the high mu- 
tation rates associated with 5'-methylcytosines, thereby 
implying that the spectrum of CpG variations may be 
shaped fully at the developmental level but not through 
natural selection. Our association mapping of within- 
pair chromatin differences identified cases in which only 
twin siblings with a particular genotype exhibited chro- 
matin discordance at the relevant locus, demonstrating 
that epigenetic differences can bring about regulatory 
variations through interactions with genetic factors. Re- 
markably, poised promoters presented high levels of 
chromatin discordance in association with either somatic 
mutations or genetic-epigenetic interactions, reflecting 
their role in inducing differential regulatory responses to 
environmental or developmental cues in twin siblings. In 
conclusion, our observations illustrate how somatic mu- 
tations and genetic polymorphisms may contribute to 
regulatory, and ultimately phenotypic, discordance. 

Materials and methods 

MZ twin samples 

This study was approved by the Institutional Review 
Board of the Korea National Institute of Health (KNIH) 
and Korea Advanced Institute of Science and Technol- 
ogy (KAIST). Written informed consent was obtained 
from all individuals. Peripheral blood mononuclear cells 
were extracted from 36 pairs of MZ twins, aged 30 to 
60 years, who were discordant for immunological traits 
mostly involving allergic symptoms (Additional file 2). 
Lymphoblastoid cells were generated as previously de- 
scribed [31]. 

Open chromatin sequencing 

Sequencing of accessible chromatin regions was per- 
formed as described previously [2,7-9]. Following quality 
control of the DNA library, 50-bp single-end Illumina 
HiSeq2000 sequencing was performed. CASAVA- 1.8.2 
was used to align sequence tags to the reference human 
genome (hgl9) using the default option, which disal- 
lowed mismatches. The mean Phred quality (Q) score of 
the bases in the passed-filter reads was averaged at 36.2 
across the 72 samples (Additional file 1). The average 
percentage of the bases with Q > 30 in the passed-filter 
reads was 91.9% (Additional file 1). 



Identification of open chromatin regions 

To identify genomic regions enriched for open chroma- 
tin tags, we employed the Zero Inflated Negative Bino- 
mial Algorithm (ZINBA) [32]. Tags were extended 
134 bp as recommended by the ZINBA program for the 
prediction of open chromatin peaks. An FDR threshold 
of 0.05 was used. For covariates, the G/C content, the 
proportion of mappable bases (mappability score), and 
the local background estimate were taken into account. 
Open chromatin regions were reported to encompass 1 
to 2% of the genome [32]. In line with this, the peaks we 
identified covered 2.01% of the hgl9 genome on average. 
The number of tags mapped to these regions is summa- 
rized in Additional file 1. The effective sequencing depth 
was computed by dividing the total number of bases 
read by the tags mapped to the identified peaks (the 
number of reads multiplied by the read length) by the 
total base pairs covered by the peaks (that is, 2.01% on 
average). For confirmation, we obtained genomic regions 
enriched for DNase I hypersensitivity as identified by 
F-Seq [33] and those enriched for FAIRE signals as 
called using ZINBA [32] from the ENCODE Open Chro- 
matin Synthesis track of the UCSC Genome Browser 
[34] for the GM12878 lymphoblastoid cell line. 

Quantification and normalization of chromatin 
accessibility 

For the quantification of chromatin accessibility, the 
open chromatin peaks identified from each of the 72 
samples were merged into meta-peaks using BEDTools 
[35]. For each sample, the accessibility signals were nor- 
malized as previously suggested [36,37]; the number of 
sequenced reads mapped to each meta-peak was 
counted and divided by the length of the meta-peak, 
which was then calibrated by the ratio of the total read 
count to the genome size. This metric measures the en- 
richment of tags within a given open chromatin locus 
relative to the whole genome. We also employed a 
quantile-quantile normalization method to remove the 
effect of between-sample variations by using the Quan- 
tile Normalization package of R. Chromatin discordance 
was calculated for each locus as the quantitative differ- 
ence between the normalized chromatin accessibility 
measures of twin siblings. The absolute difference, \X X - 
X 2 \i was also obtained, wherein X 1 and X 2 indicate chro- 
matin accessibility in each sibling of a twin pair. While 
this provides a direct measure of chromatin differences, it 
can cause a bias toward highly accessible chromatin sites 
such as active promoters or strong enhancers. Therefore, 
we normalized this measure as \X X - X 2 \/{X 1 +X 2 ) and 
used the normalized difference particularly when compar- 
ing different regulatory states. Discordant chromatin sites 
were determined based on effect size (that is, when the 
within-pair fold change was greater than four) or based on 
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variation analysis (that is, when the within-pair variance in 
chromatin accessibility was greater than the between-pair 
variance). 

Identification of mutations and polymorphisms 

Somatic mutations were defined as within-pair single 
nucleotide differences. We employed the method sug- 
gested by VarScan [38,39] for the statistical test of the 
number of aligned reads supporting each allele. The al- 
lele frequency obtained at every mapped position was 
subjected to Fishers exact test. To determine the direc- 
tion of base changes, we considered the cases in which 
one sibling in a given pair is a reference homozygote. 
Only high-confidence calls by VarScan were considered. 
We further raised the threshold by selecting those sup- 
ported by 10 or more high-quality aligned reads on both 
strands and with a P value from Fisher s exact test lower 
than 1 x 10~ 3 . For the identification of genetic polymor- 
phisms (SNPs), we employed the germline variant- 
calling functionality of VarScan to analyze variants that 
were not significantly different between twin siblings 
(that is, ones that did not pass the mutation identifier). 
Variant calls were filtered when >90% of the supporting 
reads in each sample came from only one strand in 
order to remove alignment-related artifacts. We consid- 
ered autosomal genomic positions that were covered by 
at least one read in all of the 72 samples. To be identi- 
fied as a polymorphism, a minimum allele frequency of 
1% among all the reads across the samples was required. 
Association mapping was conducted between the identi- 
fied SNPs and the normalized accessibility signals across 
the 72 samples based on simple linear regression by 
using the scipy.stats module of SciPy. Significant associa- 
tions were chosen at the FDR of 0.05 with P value ad- 
justment by using the fdrtool package of R. 

Array genotyping and validation 

Genomic DNA (500 ng) was isolated from the peripheral 
blood cells of one sibling from each twin pair and geno- 
typed on the Affymetrix Genome-Wide Human SNP 
array 6.0. Low-concentration genomic DNA was ampli- 
fied prior to genotyping according to the manufacturer s 
protocol (QIAGEN). The Bayesian Robust Linear Mod- 
eling using Mahalanobis Distance (RLMM) algorithm 
was used to perform genotype calling of 516,188 SNPs 
[40]. SNP imputation was carried out using IMPUTE 
(version 2) as described previously [19,41] on the basis 
of the genotypes of 286 Asian individuals from the 1000 
Genomes Project (CHB [ASN] (n = 97) - Han Chinese in 
Beijing, China; CHS [ASN] (n = 100) - Han Chinese 
South; JPT [ASN] (n = 89) - Japanese in Tokyo, Japan) as 
a reference panel. Validation was performed for auto- 
somal chromosomes to check whether our genotyping 
based on open chromatin sequencing would reproduce 



the microarray genotypes. To assess the effect of tech- 
nical issues such as possible biases in the allele frequen- 
cies of open chromatin sequences, we interrogated the 
available array data to determine whether the genotype 
of each individual is a reference homozygote or variant 
homo-/hetero-zygote for each locus of the identified 
mutations or polymorphisms from the open chromatin 
sequencing. For polymorphic loci, the genotypes of all 
available individuals were examined. For the mutation 
loci, the available array genotype of one of the relevant 
siblings was compared with the sequencing-based geno- 
type when the mutant allele matched one of the common 
alleles on the array. To increase the overlap between the 
platforms, we used all of the high-confidence mutations 
without using a P-value threshold. Sixty-six percent of 
the individual genotypes at individual loci were repro- 
ducible (Additional file 5). Among the unmatched cases 
(Additional file 6) were those in which the array genotype 
indicated a homozygous variant (for example, A/ A) while 
the open chromatin sequences seemed to support a het- 
erozygous variant genotype (for example, A/G). Because 
the sequencing data support the existence of the reference 
allele (that is, G), it is most likely that the reference allele 
was not detected via array hybridization. Including those 
cases (denoted in blue in Additional file 6) in the matched 
list increased the concordance rate to 68.6%. There may 
also be cases in which the subject is actually a heterozy- 
gous variant (for example, A/G) when the captured open- 
chromatin sequences might be biased toward the allele 
that increases chromatin accessibility (that is, A), leading 
to an apparently homozygous variant genotype (that is, 
A/ A). These instances are shown in red in Additional 
file 6. Because we treated variant homozygotes and het- 
erozygotes equally as one instance of variation, this type 
of mismatch did not affect the spectrum of nucleotide var- 
iations that we examined in this work. Permitting these 
cases increased the concordance rate to 73.6%. Taken 
together, the overall concordance of variant detection 
was 76.2%. It should be noted that, for validation pur- 
poses, we did not use P-value filtering. For our main ana- 
lysis, we further filtered mutations using a P-value threshold 
of 1 x 10" 3 . 

Dinucleotide counting and typing 

Complementary base pairs were concatenated by V (for 
example, A:T or C:G) and dinucleotides were connected 
by p' (for example, CpG, GpC, and so on). For the typ- 
ing of mutations and polymorphisms arising at different 
dinucleotides, we extracted three consecutive bases cen- 
tered on each of the identified variants (mutations or 
polymorphisms) using BEDTools [35] and counted the 
dinucleotide composed of the -1 base and the variant 
(-lpV) alongside the dinucleotide composed of the vari- 
ant followed by the +1 base (Vp + 1). The number of 
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dinucleotides themselves was counted in the same man- 
ner for all different trinucleotides. Changes on one 
strand were treated equally with complementary changes 
on the opposite strand. For example, the change from 
the trinucleotide GpApA (reference) to GpCpA (variant) 
was treated as 5-GpA-3' to 5-GpC-3' on one strand 
or 5-TpC-3' to 5-GpC-3' on the other strand, and 
simultaneously as 5'-ApA-3' to 5-CpA-3' on one 
strand or 5-TpT-3' to 5-TpG-3 on the other stand. 
The number of observed substitutions was counted for 
each type of dinucleotide by combining all possible 
changes from the dinucleotide type (for example, for 
ApA, the instances of ApA > CpA, ApA > GpA, ApA > 
TpA, ApA>ApC, ApA>ApG, ApA > ApT were com- 
bined). By following the above procedure, the number of 
palindromic dinucleotides (that is, CpG, GpC, ApT, and 
TpA), as well as the occurrence of mutations and poly- 
morphisms in such dinucleotides, was double-counted 
to provide fair comparison with the other dinucleotides. 

Expression microarray analysis 

Total RNA was extracted from lymphoblastoid cells 
using RNeasy Mini Kit (QIAGEN, Chatsworth, CA, 
USA) according to the manufacturers instructions. The 
yield and purity of the extracted RNA were evaluated by 
A260/A280 measurement and agarose gel electrophor- 
esis. Gene expression levels were measured for the 72 
MZ twin samples on the Illumina Human HT-12 chip. 
The raw intensities were log2 transformed and normal- 
ized by using the VST-quantile normalization method of 
the lumi R packages (v. 14.01). 

Transcription factor binding site data 

A total of approximately 4 million evolutionarily con- 
served binding sites for 250 TFs, as inferred from the 
Transfac Matrix Database (v7.0) [42,43], were obtained 
from the Human/Mouse/Rat Conserved Transcription 
Factor Binding Sites track of the UCSC Genome 
Browser. The sites of in vivo TF binding were obtained 
from the ENCODE Transcription Factor Binding tracks. 
All data available for the GM12878 cells were generated 
either by the Hudson Alpha Institute for Biotechnology 
(HAIB) or by Stanford/Yale/USC/Harvard (SYDH). We 
first discovered regions enriched for TF binding by using 
the peak finding functionality of the Homer package 
[44]. We then located the peak summit as the position 
at which the maximum number of ChlP-seq tags over- 
lapped within the given region. ChlP-seq peaks in which 
less than 80% of the sequencing tags covered the peak 
summit were discarded. In this manner, we were able to 
select peaks that were most likely to contain the focused 
binding site of a single TF. However, this set of TFBSs 
was not used for mutation typing because the actual 
binding motifs of TFs are hard to predict based on the 



library fragments, which are only 200 bp on average. 
Additionally, we searched the open chromatin peaks for 
enriched TF motifs by using the find MotifsGenome 
function of Homer [44]. Homers library of known 
motifs was screened against the target and background 
sequences for enrichment, and motifs enriched with a 
P < 0.05 based on the binomial distribution were se- 
lected. For de novo motif finding, motifs of 8, 10, and 
12 bp in length were screened for enrichment. 

Dinucleotide enrichment in transcription factor binding sites 

To measure the relative enrichment of each dinucleotide 
in TFBSs, compared with background sequences, we 
counted each dinucleotide located in a TFBS and its sur- 
rounding open chromatin region as described above. 
The total occurrence of the dinucleotide in a TFBS was 
divided by its occurrence in the surrounding open chro- 
matin regions. To control for differences in the sizes of 
TFBSs and open chromatin regions, the number of all 
the dinucleotides in the TFBSs was divided by the num- 
ber of all the dinucleotides in the surrounding open 
chromatin regions. The ratio obtained for each specific 
dinucleotide was then divided by the ratio obtained for 
all possible dinucleotides. This ratio of ratios then served 
as the measure of the relative enrichment of dinucleo- 
tides in TFBSs (Additional file 10). We also repeated this 
test using the whole genome as the background, instead 
of just the open chromatin regions, to adjust for the dif- 
ferential frequencies of different dinucleotides across the 
whole genome (Additional file 11). We then repeated 
the same procedure for mutated dinucleotides and poly- 
morphic dinucleotides. Specifically, the relative enrich- 
ment of substituted dinucleotides in TFBSs was obtained 
as (Number of given substituted dinucleotides in TFBSs/ 
Number of given substituted dinucleotides in surround- 
ing open chromatin) /(Number of all substituted dinucle- 
otides in TFBSs/Number of all substituted dinucleotides 
in open chromatin). This ratio was divided by the relative 
enrichment of the dinucleotides themselves as described 
above, namely as (Number of given dinucleotides in 
TFBSs/Number of given dinucleotides in surrounding 
open chromatin) /(Number of all dinucleotides in TFBSs/ 
Number of all dinucleotides in open chromatin). This 
metric was used to estimate the observed-to-expected ra- 
tio of mutations or polymorphisms. 

Quantitative trait loci mapping 

The normalized chromatin difference, \X X - X 2 \/(X 1 +X 2 ), 
was obtained using XI and X2 to indicate the chromatin 
accessibility in each sibling of a twin pair. Because the 
within-pair chromatin differences are very low in most 
cases, the chromatin loci with the highest between-pair 
variances in chromatin discordance (that is, the top 1%) 
were selected. We used the genotypes obtained from the 
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Affymetrix Genome- Wide Human SNP array 6.0, which 
were imputed as described above, and filtered at the 
minor allele frequency of 1%. A simple linear regression 
model from the Matrix eQTL package was applied [45]. 
G's-associations were defined as a distance of less than 
1 Mb between the genetic marker and the associated 
chromatin site. 

Other data 

Cytosine methylation levels were derived from whole- 
genome shotgun bisulfite sequencing data from human 
embryonic stem cells, hematopoietic stem/progenitor 
cells, B lymphocytes, and neutrophils [20]. Data for dif- 
ferent regulatory states in the GM12878 lymphoblastoid 
cell line were downloaded from the Chromatin State 
Segmentation by HMM from the ENCODE/Broad track 
of the UCSC Genome Browser. DNase I hypersensitive 
sites in 70 Yoruba lymphoblastoid cell lines [13] were 
obtained from [46]. The genotypes given by the 1000 
Genomes YRI (Yoruba in Ibadan) high coverage cohort, 
YRI HapMap panel, and 1000 Genomes YRI low coverage 
cohort were used in the order of preference depending on 
data availability [13]. The combined final genotype for 
each locus in each individual was obtained from [47]. We 
selected SNPs falling within DNase I hypersensitive sites 
in each sample and performed polymorphism typing as 
described above. 

Data access 

The expression profile and FAIRE-seq data from this 
study have been submitted to the NCBI Gene Expression 
Omnibus under accession numbers [GEO:GSE53822 
and GSE44742]. 

Additional files 



Additional file 1: Information of the MZ twin FAIRE-seq data. 

Additional file 2: Clinical information of the MZ twins used in 
this work. 

Additional file 3: Density plot for genome-wide correlation coefficients 
of chromatin accessibility between twin siblings and between unrelated 
individuals. 

Additional file 4: The number of open chromatin regions we 
identified that overlap with peaks identified from the public 
FAIRE-seq and DNase-seq data for the GM12878 lymphoblastoid 
cells. Our FAIRE regions were extended 10 to 200 bp before overlapping. 
Approximately 83% of the FAIRE regions that we found to overlap with 
the public FAIRE data were confirmed by the public DNase I data. 

Additional file 5: List of sequencing variants whose genotype is 
confirmed by Affymetrix SNP array 6.0. 

Additional file 6: List of sequencing variants whose genotype is in 
conflict with the Affymetrix SNP array 6.0 genotype. Red/blue 
coloring indicates cases in which a sequencing variant calling matches 
an array variant calling in terms of genotype identity irrespective of 
homoVhetero-zygosity. 



Additional file 7: Within-pair differences in chromatin accessibility 
as a function of the distance between the mutation and the center 
of the chromatin region. 

Additional file 8: Within-pair differences in chromatin accessibility 
according to the number of mutations per base pair as an estimate 
of the density of mutations relative to the size of the open 
chromatin region. 

Additional file 9: Observed-to-expected ratios of the substitution 
frequency of TFBS mutations and polymorphisms that were 
associated with chromatin discordance and inter-individual 
variation, respectively. 

Additional file 10: The relative enrichment of dinucleotides in 
TFBSs. The ratio of the number of the specified dinucleotides in TFBSs to 
the number in the surrounding chromatin regions was divided by the 
ratio for all the different dinucleotides. 

Additional file 11: The relative enrichment of dinucleotides in 
TFBSs. The ratio of the number of each dinucleotide in TFBSs to the 
number in the whole genome was divided by the ratio of the number 
of all the different dinucleotides in TFBSs to the number in the 
whole genome. 

Additional file 12: Differential methylation levels between B 
lymphocytes and other cell types (embryonic stem cells (ESC), 
hematopoietic progenitor cells (HPC), and neutrophils) for each 
TFBS CpG. The average of cytosine methylations in TFBSs was obtained 
and plotted. 

Additional file 13: The frequency of different types of 
polymorphisms arising in TFBSs as identified in this work (dark 
blue) and in the previous work by Degner et al. [13] (sky blue). 

Only changes from the reference homozygote were considered. 

Additional file 14: The frequency of TFBS dinucleotides containing 
polymorphisms identified in this work (dark blue) and in the 
previous work by Degner et al. [13] (sky blue). 

Additional file 15: Left: for the TFBS polymorphisms identified by 
Degner et al. [13], observed-to-expected ratios were obtained as 
the ratio of the relative enrichment of polymorphic dinucleotides in 
TFBSs to the relative enrichment of all dinucleotides in TFBSs. 

Right: for each TFBS CpG, differential methylation levels between B 
lymphocytes and other cell types (embryonic stem cells (ESC), 
hematopoietic progenitor cells (HPC), and neutrophils) were calculated. 
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